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NETWORK INTERFACE AND PROTOCOL : 1|| 

MM 

This invention relates to a network interface grjbl a protocol for use in passing data 
over a network. . * 



When data is to be transferred between two ^|^es over a data channel, each of the 



devices must have a suitable network inte^ce:fo f allow it to communicate across the 
channel. The devices and their network inteWaic.es use a protocol to form the data 
that is transmitted over the channel, so thatjfecan be decoded at the receiver. The 



data channel may be considered to be or to^|Ti part of a network, and additional 
devices may be connected to the network. 

The Ethernet system is used for many networ^g. applications. Gigabit Ethernet is a 
high-speed version of the Ethernet protocol, w||^h is especially suitable for links that 
require a large amount of bandwidth, such as/iinks between servers or between data 
processors in the same or different enclosures-: Devices that are to communicate 



over the Ethernet system are equipped with|^twork interfaces that are capable of 
supporting the physical and logical requirements of the Ethernet system. The 
physical hardware component of network 1'Merfaces are referred to as network 
interface cards (NICs), although they need ngjpe in the form of cards: for instance 
they could be in the form of integrated circuitries) and connectors fitted directly on 
to a motherboard. < l m§$> : 

ISf * 

■ JP' 

Where data is to be transferred between co||erating processors in a network, it is 
common to implement a memory mappedr^y^em. In a memory mapped system 
communication between the applications is; achieved by virtue of a portion of one 
application's virtual address space being rr^a^ied over the network onto another 
application. The "holes" in the address spacil&vhich form the mapping are termed 
apertures. iLfpk' 
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Figure 1 illustrates a mapping of the virtualil'address space (Xo«X n ) onto another 
virtual address space (Y 0 -Y n ) via a network.. M|such a system a CPU that has access 
to the Xo-Xn memory space could access aMration Xi for writing the contents of a 
register n to that location by issuing the &ej;e ; instruction [st n , xi]. A memory 
mapping unit (MMU) is employed to map th^irtual memory onto physical memory 
location. 



The following steps would then be taken: I 

1. CPU emits the contents of n (say yajcje 10) as a write operation to virtual 
address Xi ^'8^ 

2. The MMU (which could be within the^j?U) turns the virtual address x-t into 
physical address pcii (this may includSage table traversal or a page fault) 



The CPU's write buffer emits the "write^jO, pen" instruction which is "caught" 
by the controller for the bus on whic^^e CPU is located, in this example a 
PCI (Input/Output bus subsystem) controller. The instruction is then forwarded 
onto the computer's PCI bus. \Sw-- 



4. A NIC connected to the bus and interfacing to the network "catches" the PCI 
instruction and forwards the data to ^^destination computer at which virtual 
address space (Y 0 -Y n ) is hosted. 



5. At the destination computer, which isMssumed to have equivalent hardware, 



the network card emits a PCI write transaction to store the data in memory 
6. The receiving application has a virtuajrmemory mapping onto the memory and 



may read the data by executing a "loadM" instruction 



These steps are illustrated by figure 2. 




ure illustrates that at each point that 



the hardware store instructions passes fro^bne hardware device to another, .a 
translation of the address from one addressjspace to another may be required. Also 
note that a very similar chain of events suppo|s:.read operations and PCI is assumed 



but not required as the host 10 bus implemerrtaitibn. 



Hence the overall memory space mapping^ 
series of sub-mappings as follows: 




{Y 0 - Y n } is implemented by a 



10/21/05, EAST Version: 2.0.1.4 



WO 2004/025477 



PCT/GB2003/003971 



{Xo-Xn} 

{PCI 0 , PCI n } (processor 1 address space)|§f^ 
-> 

{PCI' 0 , PCI'n} (PCI bus address space) 

Network - mapping not shown 



{PCI'o - PCI" n } (destination PCI bus addre§|*space) 



{mem 0 - mem n } (destination memory addres'sispace) 




{Y 0 - Y n } (destination application's virtual address space) 

: If* 

The step marked in figure 2 as "Network" requires the NIC / network controller to 
forward the transaction to the correct destination host in such a way that the 



destination can continue the mapping chainJHhis is achieved by means of further 



memory apertures. 



Two main reasons for the use of aperture map||ir)gs are: 

a) System robustness. At each point that a ^pping is made, hardware may check 
the validity of the address by matching agginst the valid aperture tables. This 



guards against malicious or malfunctionilli'devices. The amount of protection 
obtained is the ratio of the address sp^^; size (in bits) and the number of 
allocated apertures (total size in bits), j 
. b) Conservation of address bits. Conside||t|e case that a host within a 32 bit 
memory bus requires to access two 32 l||||fCI buses. On the face of it, these 
would not be sufficient bits, since a minimum of 33 should be needed. However, 
the use of apertures enables a subset of b ( ^|h/PCI buses to be accessed: 
{PCI 0 -PCIn}- {PCr 0 - PCI'n} and 

{pcip - pci q } - {pcr p - pci'v ! T Jfe' 
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Hardware address mappings and apertures a||pell understood in the area of virtual 
memory and I/O bus mapping (e.g. PCI)^pHowever there are difficulties in 
implementing mappings are required over a r^^W|ork. The main issues are: 

1. A large number of apertures may be;^|tiired to be managed, because one 
host must communicate with many oth^^bver the network. 

2. In network situations it is normal, for security reasons, to treat each host as its 
own protection and administrative domain. As a result, when a connection 



between two address spaces is to oefjnade over a network, not all the 



aperture mappings along the path can £e>set up by the initiating host. Instead 
a protocol must be devised which lajlbw^ all the mappings within each 



protection domain to be set by the "owi^r^of the domain in such a way that an 
end-end connection is established. 
3. For security reasons it is normal for hos^jwithin a network to not entirely trust 
the others, and so the mapping scpj^e should allow arbitrary faulty or 



malicious hosts from corrupting anothej^ibsfs memory. 



Traditionally (e.g. for Ethernet or ATM switchippf protocol stacks and network drivers 
have resided in the kernel. This has been dorfeSo' enable 

1 . the complexity of network hardware to ^hidden via a higher level interface; 

2. the safe multiplexing of network hardwire and other system resources (such 
as memory) over many applications; j 

3. the security of the system against fault^pmalicious applications. 

" ' "p ■ 

In the operation of a typical kernel stack sys|epla hardware network interface card 
interfaces between a network and the kemejlEIn the kernel a device driver layer 
communicates directly with the NIC, and amfotocol layer communicates with the 
system's application level. 

The NIC stores pointers to buffers for inc^^g data supplied to the kernel and 
outgoing data to be applied to the network. Ipftese are termed the Rx data ring and 
the Tx data ring. The NIC updates a buffer R^^r indicating the next data on the Rx 
buffer ring to be read by the kernel. The Tx|data ring is supplied by direct memory 
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access (DMA) and the NIC updates a buffeppqihter indicating the outgoing data 

• • • ^V, " • 

which has been transmitted. The NIC can signa1|to~the kernel using interrupts. 



Incoming data is picked off the Rx data ring : b^he kernel and is processed in turn. 
Out of band data is usually processed by the^|mel itself. Data that is to go to an 
application-specific port is added by pointer Ai^a- buffer queue, specific to that port, 



which resides in the kernel's private address ; s ^|t e - 

The following steps occur during operation of-^^feystem for data reception: 

• 

1. During system initialization the operati§g|system device driver creates kernel 

buffers and initializes the Rx ring of th ^^p t0 P oint to these buffers - The os 
also is informed of its IP host address frjS^configuration scripts. 



2. An application wishes to receive networfepackets and creates a Port which is 

a queue-like data structure residing virpin the operating system. It has a 

number port which is unique to host $ri;; such a way that network packets 

p^- 
addressed by <hostport> can be delivered to the port's queue. 

3. A packet arrives at the network interface card (NIC). The NIC copies the 
packet over the host I/O bus (e.g. a PG||btis) to the memory address pointed 
to by the next valid Rx DMA ring Pointe^aiue. 



4. Either if there are no remaining DMA-'|Midters available, or on a pre-specified 
timeout, the NIC asserts the I/O busy^Mtrupt in order to notify the host that 
data has been delivered. ^Mll- 

5. In response to the interrupt, the deviM|driver examines the buffer delivered 
and if it contains valid address infomrjatibn, such as a valid host address, 
passes a pointer to the buffer to the appropriate protocol stack (e.g. TCP/IP). 

6. The protocol stack determines whether^t^alid destination port exists and if so, 
performs network protocol processing (^vg. 1 generate an acknowledgement for 
the received data) and enqueues the pa&fet on the port's queue. 

7. The OS may indicate to the appliGationtfe.g. by rescheduling and setting bits 

'tiff- 
in a "select" bit mask) that a packet has|arrived on the network end point to 

which the port is bound. (By marking tHf|iapplication as runnable and invoking 

a scheduler). ISfe 
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8. The application requests data from the- ©Bf. e.g. by performing a recv() system 
call (supplying the address and size ojfppufFer) and whilst in the OS kernel, 
data is copied from the kernel buffer l ^ t ^ he application's buffer. On return 
from the system call, the application ma^access the data from the application 
buffer. 

9. After the copy, the kernel will return the;;kernel buffer to an O/S pool of free 
memory. Also, during the interrupt th&Etievice driver allocates a new buffer 
and adds a pointer to the DMA ring. L ^t|iis manner there is a circulation of 
buffers from the free pool to an applicat §n*s port queue and back again. 



10. An important property of the kernel : |L!ffers is that they are congruous in 
physical RAM and are never paged ou|||y^the VM system. However, the free 
pool may be shared as a common respy^,for all applications. 

For data transmission, the following steps occ^| 

1. The operating system device driver^creates kernel buffers for use for 
transmission and initializes the Tx ring^gfethe NIC. 



2. An application that is to transmit data stores that data in an application buffer 
and requests transmission by the OS, efg.v by performing a send() system call 
(supplying the address and size of the a|^lication buffer). 

3. In response to the send() call, the ;0^;kernel copies the data from the 
application buffer into the kernel buffe^nd applies the appropriate protocol 
stack (e.g. TCP/IP). ;||| 

4. A pointer to the kernel buffer (X>ntaining|bj3 data is placed in the next free slot 
on the Tx ring. If no slot is available, th^|uffer is queued in the kernel until the 
NIC indicates e.g. by interrupt that a sloitiss become available. 

5. When the slot comes to be processed |||the NIC it accesses the kernel buffer 
indicated by the contents of the slot by0til\A Gycles over the host 10 bus and 
then transmits the data. f^M' 

Considering the data movement through the^|tem f it should be noted that in the 
case of data reception the copy of data from|tHie kernel buffer into the application 
buffer actually leaves the data residing in the[pijc|cessor's cache and so even though 
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there appears to be a superfluous copy using/.this mechanism, the copy actually 
serves as a cache load operation. In the casferpf data transmission, it is likely that 



the data that is to be transmitted originated 'frojh; ( the cache before being passed to 



the application for transmission, in which ca^e||ie copy step is obviously inefficient 
There are two reasons for the copy step: >Mi>f 

1. To ensure the data is in pinned dowaKj^nel memory during the time that the 



NIC is copying the data in or out Thtelagplies when data is being transmitted 



or received and also has the benefit thatjhe application is not able to tamper 
or damage the buffer following kernel protocol processing. 
2. In the case of transmission it is often^tfte case that the send() system call 



returns successfully before the data has|teen actually transmitted. The copy 
step enables the OS to retain the datalinV.case retransmission is required by 

• lift 

the protocol layer. , 'fp^ 

Even if the copy step were omitted, on data reception, a cache load would take place 



when the application accessed the kernel buffer. Many people have recognized 

(see, e.g. US 6,246,683) that these * additional, copies have been the cause of 

* 

performance degradation. However, the solution's presented so far have all involved 
some excess data movement It would be desirable to reduce this overhead. The 

: W - 

inventors of the present invention have recognised that an overlooked problem is not 
the copying, but the user to kernel confe^!:switching and interrupt handling 
overheads. US 6,246,683, for instance, doesf^j^ing to avoid these overheads. 

During a context switch on a general purpos^p^erating system many registers have 
to be saved and restored, and TLB entries pfd caches may be flushed. Modem 
processors are heavily optimized for sus^jned operation from caches and 
architectural constraints (such as the memoryfgap) are such that performance in the 
face of large numbers of context switches is passively degraded. Further discussion 
of this is given in "Piglet: A Low-Intrusion Vesical Operating System", S. J. Muir and 
J. M. Smith, Tech. rep. MS-CIS-00-04, UnivfftpA, Jan. 2000. Hence it would be 

desirable to reduce context switches durin&lboth data transfer and connection 

» v# Mal- 
management 



10/21/05, EAST Version: 2.0.1 .4 




WO 2004/025477 




PCT/GB2003/003971 



In order to remove the cost of context switche^pm data transmission and reception, 
VIA (Virtual Interface Architecture) was. developed as an open standard from* 
academic work in U-NET. Further informations available in the Virtual Interface 
•Architecture Specification available from j#*ww. vidf.org. Some commercial 
implementations were made and it has sinc|||volved into the Infiniband standard. 
The basic principle of this system is to enhantee the network interface hardware to 
provide each application (network ehdpoint) wjmvits own pair of DMA queues. Txand 
Rx). The architecture -comprises a kernepagent, a hardware NIC, and a 
user/application level interface. Each applicatiqjl at the user level is given control of 
a VI (Virtual Interface). This comprises tw6|q£ieues f one for transmission, one for 
reception (and an optional CQ completion queUe). To transmit some data on a VI, 



'IV 



the application must: 

1. Ensure data is in a buffer which has b|^pinned down. (This would require 
system call to allocate) 

2. Construct a descriptor which containsp|Buffer pointer and length and add a 
pointer to the descriptor onto the senc^tfeue. (N.B. the descriptor must also 



be in pinned down memory). 



3. If necessary the application indicates that the work queue is active by writing 
to a hardware "doorbell" location on the§NIC which is associated with the VI 
endpoint. 

4. At some time later the NIC processesltfie send queue and DMAs the data 
from the buffer, forms a network pacKetSand transmits to the receiver. The 
NIC will then mark the descriptor assorted with the buffer to indicate that the 
data has been sent. 



It is possible to also associate the VI with a completion queue. If this has been done 
the NIC will post an event to the completioQ|qLieue to indicate that the buffer has 
been. sent. Note that this is to enable one^bplication to manage a number of VI 



queues by looking at the events on only one depletion queue. 



Either the VI send queue or completion queue^njay be enabled for interrupts. 
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To receive a buffer the application must crelta^a descriptor which points to. a free 
buffer and place the descriptor on the rec^e queue. It may also write to the 
"doorbell" to indicate that the receive queue ij^tive. 

When the NIC has a packet which is add resale! to a particular VI receive queue, it 
reads the descriptor from the queue and dete^nines the Rx buffer location. The NIC 
then DMAs the data to the receive buffer andVrjdicates reception by: 

1. marking the descriptor 

2. generating an event on the a)mpletioh|queue (if one has been associated with 

. W 

the VI) m- 

3. generating an interrupt if one has beep||equested by the application or kernel 
(which marks either the Rx queue or completion queue). 



There are problems with this queue based mpdgl: 

1 . Performance for small messages i^oor because of descriptor overheads. 

2. flow control must be done by t|||application to avoid receive buffer 



overrun. 



Also, VIA does not avoid context switches gponnection setup and has no error 
recovery capabilities. This is because it W^^itended to be used within a cluster 
where there are long-lived connections a n;^||||^ Q mrs - ,f an error occurs, the VI 
connection is simply put into an error state||and usually has to be torn down and 
recreated). . 

VIA connection setup and tear down proceeds!? using the kernel agent in exactly the 
same manner as described for kernel stack&processing. Hence operations such as 
Open, Connect, Accept etc all require contextfsw'rtches into the kernel. Thus in an 
environment where connections are short liydd (e.g. WWW) or errors are frequent 
(e.g. Ethernet) the VIA interface performs badly||; : 

VIA represents an evolution of the messag¥|Rassing interface, allowing user level 
access to hardware. There has also been kndther track of developments supporting 
a shared memory interface. Much of this.|||earch was targeted at building large 
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single operating system NUMA (non-unifonri^rtiiimory architecture) machines (e.g. 
Stanford DASH) where a large supercomputers* built from a number of processors, 
each with local memory, and a high-speed|interconnect For . such machines, 
coherency between the memory of each notiej.was maintained by the hardware 



(interconnect). Coherency must generally 'ensure that a store/load operation on 
CPUi will return the correct value even where^||ere is an intervening store operation 
on CPU 2 . This is difficult to achieve when Cl?i|?is allowed to cache the contents of 
the initial store and would be expected to return' the cached copy if an intervening 
write had not occurred. A large part of the IEEE|standard for SCI (Scalable Coherent 

Interconnect) is taken up with ensuring coherency. The standard is available from 

if * " 

www.vesa.org. 

111 

Because of the NUMA and coherency heritagepjf shared memory interconnects, the 
management and failure modes of the clusfetgwere that of a single machine. For 
example implementations often assumed. ' 

1. Single network address space which was; managed by a trusted cluster-wide 
service (e.g. DEC's Memory Channel: ||f,,R Gillett Memory Channel Network 
for PCI, IEEE Micro 16(2), 12-18 Feb 96^ 

2. No protection (or no page level prote^on) measures to protect local host 



memory from incoming network accessk£&g. SCI) 
3. Failure' of a single processor node cmisipg failure of whole "machine" (e.g. 
SCI) 

In a Memory Channel implementation ofl^cluster wide connection .service, 
physically, all network writes are passed to al^ifp^es in the cluster at the same time, 
so a receiving node just matches writes agaiWsfcfjts incoming window. This method 
does provide incoming protection (as we l^fe; but address space management 
requires. communication with the managementf5(jpe and is inefficient. 



SCI is similar except that 12 bits of the ad^ess space is dedicated to a node 
identifier so that writes can be directed to a particular host. 'The remaining 48 bits is 
host implementation dependent. Most implem^|tations simply allocate a single large 
segment of local memory and use the 48 bits; as||n offset. 
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No implementation has addressed ports or dMH^uted connection management as 
part of its architecture. • * ^ 

Some implementations provide an event mec^'ism, where an event message can 

be sent from one host to another (or from network to host). When these are software 

•It- 
programmable, distributed connection set up losing ports is possible. However, since 

these mechanisms are designed for (rare) e^r handling (e.g. where a cable is 



unplugged), the event queue is designed to be|&: kernel only object - hence context 

■ ^bi- 
switches are still required for connection ma^lgement in the same manner as the 

VIA or kernel stack models. ■ ||||. 



According to one aspect of the present inventiprjithere is provided a communication 
interface for providing an interface between); ^jata link and a data processor, the 
data processor being capable of supportinepan operating system and a user 
application, the communication interface beinglffranged to: support a first queue of 
data received over the link and addressed tqM logical data port associated with a 
user application; support a second queue of da^received over the link and identified 
as being directed to the operating system; a^|anaiyse data received over the link 
and identified as being directed to the operatin|^ystem or the data port to determine 
whether that data meets one or more predefined criteria, and if it does meet the 
criteria transmit an interrupt to the operating system. 




Conveniently the user application has an isH£r ess s P ace and the f irst queue is 
located in that address space. Convenientlyf^p operating system has an address 



space and the second queue is located in tha^cldress space. Most conveniently at 
least part of the address space of the user appj|cation is the same as at least part of 
the address space of the operating system. • g^erably the all the address space of 
the user application lies within the address spape|of the operating system. 

The communication interface is preferably ^^^moged apply to- the first queue data 



received over the link and identified as beii&$ ; directed to the data port. The 
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communication interface is preferably arranged apply" to '"the second queue'' date 

received over the link and identified as being directed to the operating system. ' 

. ' M. ■ . 

Preferably one of the predefined criteria is sucfifthat if the data received over the link 
matches one or more predetermined message forms then the communication 
interface will transmit an interrupt to the operating system. 

Preferably the communication interface is arr^njjed to, if the data meets one or more 
of the predefined criteria and one or more aM|ional criteria transmit an interrupt to 
the operating system and transmit a messag||to the operating system indicating a 
port to which the data was addressed. Preferjafiy the additional criteria are indicative 
of an error condition. 



Preferably the communication interface is arranged to support a third queue of data 
received over the link and addressed to a logical data port associated with a user 
application, and is arranged to apply to the first queue data units received over the 
link and of a form having a fixed length and^tp apply to the third queue data units 
received over the link and of a form havingfrafvariable length. Preferably the data 
units of a fixed size include messages receivedjover the link and interpreted by the 



9 

communication interface as indicating an err^^tatus. Preferably the data units of a 



WW 



fixed size include or may include messages; f^e^eived over the link and interpreted by 



the communication interface as indicating a Request for or acknowledgement of set- 
up of a connection. Preferably the data 'uriits 1 of a fixed size include messages 
received over the link and interpreted by thefe^munication interface as indicating a 
data delivery event. 

Preferably the communication interface is arranged to analyse the content of each 
data unit received over the link and to determine in dependence on the content of 
that data unit which of the said queues to appl^the data unit to. 

• . ■it 

Preferably the communication interface is configurable by the operating system to set 
the said criteria. . fffax 
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Preferably one or both of the communication ^interface and the operating system is 
responsive to .a message of a predeterminea|t^pe to, return a message including 
information indicative of the status of the port. 



According to the present invention there is alsb-provided a communication system 
including a communication interface as set outsabove and the said data processor. 

The data processor is preferably arranged to; jWljen the processing of an application 
with which a data port is associated is suspljcied, set the criteria such that the 
communication interface will transmit an infelTUpt to the operating system on 



receiving data identified as being directed to tn^|^ata port. 



According to a second aspect of the pr^seirit invention there is provided a 
communication interface for providing an interra^ between a data link and first data 

I. _ !• XI : Ml -£/■£• ?._X C I : XI X — —^~\ 



processing apparatus including a memory, theSdat'a interface being such that a region 
of the memory of the first data processing ap^afatus can be mapped on to memory 
of a second data processing apparatus connected to the communication interface by 
the link, the communication interface being arranged to, on establishing a mapping of 
a first range of one or more memory loca|||is in the second data processing 
apparatus on to a second range of one or more'! memory locations in the first data 
processing apparatus, transmit to the secd^data processing apparatus - data 
identifying the first range of memory locations. 

It- 

Preferably the memory of the second data professing apparatus is virtual memory. 
Preferably the memory locations in the merrTory of the second data processing 
. apparatus are virtual memory locations. Mosffpreferably said one or more memory 
locations in the memory of the first data processing apparatus are one or more virtual 

memory locations and the communication interface is arranged to, on establishing 

'WW 

the said mapping, establish a further mappii|g|of the one or more virtual memory 
locations on to one or more physical memory Jdbations in the memory of the first data 




WO 2004/025477 , gj^if PCT/GB2003/003971 

Mr 



Preferably the communication interface is arranged to, on establishing a mapping of 

.. . ■ . 

a first range of one or more memory ldcation)sVin the memory of the second data 

processing apparatus on to a second range o$orie or more memory locations in the 

■ • ' • *■ • vi&l? ■ 

memory of the first data processing apparatusf£aIlocate an identity to that mapping 

and transmit that identity to the second data p^^ssing apparatus 

Preferably the communication interface is c^.a^Ie of communicating by means of 
data messages which specify a destination pp&to which data they contain is to be 
applied. :Wm>- 

Preferably the communication interface is arrair^ged to f on establishing a mapping of 

a first range of one or more memory locatiops^in the memory of the second data 

re- 
processing apparatus on to a second, range of^qne or more memory locations in the 



memory of the first data processing apparatus?; determine check data and transmit 
the check data to the second data processing apparatus, and wherein the 
communication interface is arranged to reject|sUbsequent communications over the 
mapping which do not indicate the check pMa. Preferably the check data is 
randomly generated by the communication interface. Conveniently, to indicate the 



check data a communication includes the chec||iata. 

Preferably the communication interface isj^|^nged to modify the check data, 
according to a predefined scheme, during th^| l eration of the mapping. Then it is 
subsequent communications over the mappinM^at do not indicate that modified data 
that the communication interface is preferaBl^larranged to reject. Preferably the 
check data represents a number and the predefined scheme is to increment the 
number represented by the check data b0|t predefined amount each, time a 
predefined number of communications ove^the mapping is accepted. The 
predefined amount is preferably but not necesfarily one. The predefined number is 
preferably but not necessarily one. ' ; 
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Preferably "the dom^unication ' interface arranged to reject subsequent 
communications over the mapping which indicate a request for accessing data 



outside the first range. . 

.Sri" 1 

According to the present invention there is also provided a communication system 
including a communication interface as set outfabove and the said data processor. 

• . * -%f<; ' • 

The data processor is preferably capable crf|stipporting an operating system and a 
user application. The system preferably comorises a data store which stores items of 



application. The system preferably comgnses a data 



data defining operation parameters for communications over the data link to transmit 

•^Ill- 
data stored in the first range or receive data f^r/storage in the first range. 



The operating system may be arranged to peprnit a user application to access one or 
more items of data in the data store dependent on a level of trust granted to the 



application. 



The check data is preferably stored as one otMe items of data in the data store, the 
operating system is arranged to permit at leastjsome user applications to have write 



access to that item of data, and the communication interface is arranged to, in order 
to determine the check data, read the contents that item of data and treat it as the 
check data. \$fg$ 

Preferably items of data in the data store defife the start and end points of the" first 
range of memory locations in the memory o^^s first data processing apparatus and 
store the start and end points of the secg|pPi. range of memory locations in the 
memory of the second data processing apparatus. Preferably the operating system, 
is arranged to. permit applications having oj|||pr more levels of trust to have write 
access to the items of data in the data store Wat second, and store the start and end 
points of the second range of memory locati^s in the memory of the second data 
processing apparatus and to permit no applicjWpns to have write access to the items 
of data in the data store that define the star|jand end points of the first range of 
memory locations in the memory of the first data" processing apparatus. 

■ w 
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The present invention will now be described b^f y of example with reference to the 
accompanying drawings. 

Preferably the communication- interface is^cgp^ble of supporting a plurality of 

* ■ Mr, ; ' ■ 

mappings each of a respective first range of on^br more virtual memory locations in 
the second data processing apparatus on to^^espective second range of one or 
more memory locations in the first data processing apparatus, and for each such 
mapping a respective further mapping of the respective one or more virtual memory 
locations on to one or more physical memory locations in the memory of the first data 
processing apparatus. 



.Preferably the communication interface includes)^ translation interface for translating 

■ j®/ 

accesses to or from each of the said ranges of^he or more virtual memory locations 
into accesses to or from the respective one or pore physical memory locations in the 
memory of the first data processing apparatusfanci for translating accesses to or from 
-each of the one or more physical memory locations in the memory of the first data 
processing apparatus into accesses to or fronfi|e respective ranges of one or more 
virtual memory locations. Preferably the v lr^|' memory locations are local bus 
addresses, for example PCI bus addresses Mf] access to a location is suitably a 
write access to that location. An access frqi^|k|location is suitably a read access 
from that location. - §1 



Preferably the communication interface comprises a mapping memory arranged to 
store specifications of the said further mapping!^ The mapping memory preferably 
comprises a first mapping memory local to {[^translation interface, and a second 
mapping memory less local to the translation^ interface than the first mapping 
memory, and wherein the communication interface is arranged to store specifications 
of all of the further mappings in the secoj|ei;i, mapping memory, and to store 
specifications of only some of the further mappings in the first mapping memory. 
Preferably the first mapping memory is an assistive memory. 
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Preferably the translation' interface is arrahge^Jto,; in "order to translate 'Between v an 

access to or from one of the said ranges of oae.pr more virtual memory locatiohs'and 

ari* access' to or from the respective one or more physical memory locations in .the 

memory of the first data processing apparatus^preferentially access the first mapping 

memory, to implement the translation, and if t this ^specification of the mapping of the 

range of virtual memory locations the subjectpf rthe access is not stored in the first 

■ • ■ . '• ■ ■■ • • • • • . . • 

mapping memory to access the second gapping memory to implement the 

translation. "mk 1 * 



Preferably the communication interface is arranged to store specifications of the most 
recently used further mappings in the first mapping memory. Preferably it is arranged 
to, if an attempt to access a specificatio^^m the first mapping memory is 
unsuccessful, replace a specification in first mapping memory with the* 
specification the attempt to access which was#isuccessful. 

■m 

In the drawings: 

figure 1 illustrates mapping of one address space on to another over a 
network; ' Iff!' 

figure 2 illustrates the architecture of a. prior art memory mapped architecture; 

figure 3 is a schematic diagram of a da|||ransmission system; 

figures 4 and 5 illustrate mapping of bitsf^f an address; 

figure 6 illustrates memory space aperj^res and their management domains; 

figure 7 illustrates features of a port; 

figure 8 illustrates a queue with contror|)ipcks; 

figure 9 illustrates a dual queue mechanism; 

figure 10 shows an example of an outgoing aperture table; 

figure 1 1 shows an example of an incoming aperture table; 

figure 12 shows the steps in a PCI writelffor an outgoing aperture; and 

figure 13 illustrates the operation of pointers in fixed and variable length 
queues. < 
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Figure\3''js'a schematic diagram 'of a data tr^^miskion System whereby a flrsXdata* 
processing unit (DPU) 20 can communicate^th a second data processing unit: 21 
over a'metwbrk link 22. Each" data .processing|upit comprises a CPU 23, 24" which is 
connected via a memory bus 25, 26 to a PC|||6ntroller 27, 28. The PCI controllers 
control communications over respective PGI|piuses 29,30, to which are connected. 
NICs 31, 32. The NICs are connected to eag%other over the network. Other similar 



data processing units can be connected to thefhetwork to allow them to communicate 
with each other and with the DPUs 20, 21. Local random access memory (RAM) 33, 



34 is connected to each memory bus 25, 26:;J||| ! 

The data transmission system described merein implements several significant 

•IP'.' 

features: (1) dynamic caching of aperture makings between the NICs 31, 32; (2) a 
packet oriented setup and teardown arrangement for communication between the 
NICs; and (3) the use of certain bits that Hne herein termed "nonce bits" in the 
address space of one or both NICs. $|p 

rife 

Dynamic Caching of Aperture Entries |'^| 

A small number of aperture mappings can;tj%|tored efficiently using a static table. 
To implement this, a number of bits (the m'apipts) of an address are caught by the 
address decode logic of an NIC and are usedfas an index into an array of memory 
which contains the bits that are used for reversing the mapping (the remap bits). For 



example, in a system of the type illustrated Mfigure 3 an NIC might receive over the 
PCI bus 29 a request for reading or writinglgata at a specified local address. The 
NIC stores a mapping that indicates the remote address that corresponds to that 
local address, the transformation being perfqrpied by substituting one or more of the 
bits of the local address. For example, the sepond and third nibbles of the address 

could be substituted. In that case to accessible" remote address that corresponds to 

Tafias?.- 



a local address of 0x8210BEEC the NIC wpi^p^access the mapping table, determine 
the mapping for bits "21" (suppose that^i bits "32") and then address the 



corresponding remote address (in this exampli^0x8320BEEC). (See figure 4) 
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This method- Is scalable up to a few hundrisd^fithousanci eritries depending bh'they.' 
implementation technology used (typically FR^; or ASIC) but is limited by the space 
available withinthe device that is'used'to hbld)tf»e mapping table; A superior method ' 

. • V , ' • ' - 

of implementation is to store the. mappingspinja larger store (to which access is 
consequently slower) and to cache -the ( mdst recently used mappings in an 
associative memory that can be accessed qijickly. If a match for the bits that are to 
be substituted is found in the associative memory (by a hardware search operation) 
then the remap is made very quickly. If *f|^match is found the hardware must 



perform a secondary lookup in the larger memory (in either a table or tree structure). 
Typically the associative memory will be implemented on the processing chip of the 



NIC, and the larger memory will be implementedpff-chip, for example in DRAM. This 
is illustrated in figure 5. This method is somewhat similar to the operation of a TLB 
on a CPU; however here it is used for ari^ntirely different function: i.e. for the 
purpose of aperture mapping on a memory mlibped network card. 

■ IP 

In practice, the mapping information musfScontain all the address information 
required to transmit a packet over a network. KFjfiis is discussed in more detail below. 

Packet oriented connection setup and tear dowj&protocol 



A protocol will now be described for establishing a connection between two 



applications' address spaces using aperture^^vhere there are two administration 
domains (one belonging to each of thet^|imunicating hosts). The general 
arrangement is illustrated in figure 6. In domain A there is a host A having a virtual 
address space A and an NIC A that can accessfthe virtual address space. In domain 
B there is a host B having a virtual address space B and an NIC B that can access 
the virtual address space. The NICs are connected together over a network. 



In this example mapping entries for device^|in t domain A can only be set by the 
operating system on host A. A further implementation in which an application A 
running on host A is allowed to set some (buf^riot all) bits* on an aperture mapping 



within domain A is described below. 



■if 1 1 
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The connection protocol to be described ufesjOP (Internet Protocol) datagrams fo 



transfer packets from one host to another |]j|st|as for standard Ethernet networks). 
The datagrams are addressed as <host:port^^here <host> is the network identifier 
.of the destination host and <port> is .an jd|ntifier for the application (NB each 
application may have a number of allocated 'parts corresponding to different network 



connections) within the host. It will be appreciated that the present protocol could be 
used over other transport protocols than IP. fegV 



In the present protocol the connection setup|p/oceeds as follows, assuming host A 

wishes to make an active connection to a passive (accepting) host B on which an 

IIP? 

application B is running. ^| 

1. Application B publishes its accepting|pernet address <hostB:port B > this can 
be accessed over the network in the n$roial way. 

2. Application A (which for convenience iwjjl', be referred to as host A) presents a 
request to Operating System A for th&^eation of an incoming aperture onto 
memory within host A to be used fori communication. Once this aperture has 
been defined its details are programr^J' on NIC A so that incoming network 
writes that are directed to addresses inlhat virtual space will be directed onto 
the corresponding real addresses in memory A. The aperture will be given a 
reference address: in-index A. ''Illf 

mm 

3. The host A sends an IP datagrampo <host B :port B > which contains: the 
connect message: 

[CONNECT/in-inde><A] t ™ 
Note that the full IP datagram will-J|s# contain source and destination IP 



addresses (and ports), as normal. 

4. The connect message is received b$|application B. The message may be 
received either directly to user level'^tp the operating system (according to 
the status of the dual event queue) a^epcribed later. 

5. Host B recognises the message as J^^fg a request to connect to B, offering 
the aperture in-index A. Using rui4%pre-programmed at B (typically for 
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security reasons), host B will decide whether to reject or accept the '.connection. 
If B decides to accept the connection-it creates an (or uses a pre-created) 
incoming aperture which is mapped o||p memory B and is ^given reference 
address: in-index B . Host B may|c|pose to create a new port for the 
connection: port' B . Host B sends bacfeto T : ;host A an accept message as an IP 
datagram: 

[ACCEPTAportVin-indexs] 
to host A. Note that the full IP datagram will also contain source and 
destination IP addresses (and ports)^aJpc>rmal. 

Once this has been received, each host q^created an aperture, each NIC is set 
up to perform the mapping for requests tb^aad or write in that aperture, and each 
host knows the reference address of the other host's aperture. 

6. Following the messaging discussed^sjo far, both hosts create outgoing 
apertures. A creates an aperture whic||maps application A's virtual address 
space onto NIC A outgoing aperture ^|)TJndex A . . This outgoing aperture 
maps onto [host^m-indexs] which ma^onto memory B. Host B creates a 
similar outgoing aperture out-indexs ; :^p h ma P s onto nnemory A. By this 
means, bi-directional communication is||ossible through the memory mapped 
regions. At any time the applicationsim^y send a message to the port, which 
is associated with the memory mappinMfhese may be used to guarantee out 
of band data for example: fW$* 

(i) A CLOSE message to indicatevffiatVthe connection and hence memory 
mappings should be closed dovi& 

(ii) An ALIVE message to request^, response from , an non-responding 
application lALIVEACK would b^jfte response] . 

(iii) An ERROR message which us generated by any hardware element on 
the data path which has detected^ data transfer error. This message 
is important because it allow^peedback to be provided from the 
memory mapped interface. 
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Note .that where an '..application already h"as^a,yirtual address mapping onto an 
outgoing aperture, step 6 reduces to a request for the NIC to map the outgoing 
aperture onto a particular host's incoming aperture. This is described further in terms 
of user level connection management below.^sl 



Dual Event Queues 

In the present context a port will be- considered to be an operating system specific 
entity which is bound to an application, has;. L an address code, and can receive 
messages. This concept is illustrated in figute::7. One or more incoming messages 
that are addressed to a port form a message queue, which is handled by the 



operating system. The operating system h§j$'previously stored a binding between 
that port and an application running on th^foperating system. Messages in the 
message queue for a port are processed b|||f|e operating system and provided by 
the operating system to the application to which that port is bound. The operating 
system can store multiple bindings of pdrts, to applications so that incoming 
messages, by specifying the appropriate ^o^/can be applied to the appropriate 
application. 



The port exists within the operating systerws'6 ! that messages can be received and 

MS': 

securely handled no matter what the state/bf the corresponding application. It is 
bound (tethered) to a particular application ahjdjhas a message queue attached. In 
traditional protocol stacks, e.g. in-kernel TCp/IP ; all data is normally enqueued on the 
port message queue before it is read by, ^^application. (This overhead can be 
avoided by the memory mapped data transfe||iechanism described herein). 



■IP 

In the. scheme to be described herein, only put: of band data is enqueued on the port 
message queue. Figure 7 illustrates this fomiCONNECT message. In figure 7, an 



incoming packet E, containing a specificatio||Q|a destination host and port (field 50), 
a message type (field 51) and an index (fiel^2), is received by NIC 53. Since this 



data is a CONNECT message it falls into th^fcTass of out of band data. However, it is 



m 

Twits'! 
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st|[rapB|edt6 the message queue. 54 of the : a|gropriate port 55, from where , it can ;be 
read by the application that has been assigne^lby the operating system to that port. 

■ : W : • 

A further enhancement is to use a dual queuigkassociated with a port. This can help 

• ' : - ■ * *•* ," . " • .". * 

to. minimise the requirements to make system calls when reading out of band 

messages. This is particularly useful wherefthere are many messages e.g. high 

■ ■ ■■ * • - * ■ . 1 lm}y 

connection rate as for a web server, or a highferror fate which may be expected for 



Ethernet. 



At the beginning of its operations, the operating system creates a queue to handle 



out of band messages. This queue may be^|itten to by the NIC and may have an 
interrupt associated with it. When an appl|eation binds to a port, the operating 
system creates the port and associates itjjwtth the application. It also creates a 
queue to handle out of band messages for that ; port only . That out of band message 
queue for the port is then memory mapped^into the application's virtual address 
space such that it may de-queue events withotitirequiring a kernel context switch. 

The event queues are registered with the Nl^^nd there is a control block on the NIC 
associated with each queue (and mapped itjtc&either or both the OS or application's 
address space(s)). r.Wki 

A queue with control blocks is illustrated i$$gure 8. • The queue 59 is stored in 
memory 60, to which the NIC 61 has acces$f|Associated with the queue are a read 
pointer (RDPTR) 62a and a write pointer (V^f|gTR) 63a, which indicate the points in 
the queue at which data is to be read andMforitten next. Pointer 62a is stored in 
memory 60. Pointer 63a is stored in NIGp>1. Mapped copies of the pointers: 
RDPTR' 62b and WPTR' 63b are stored in the|other of the NIC and the memory than 
the original pointers. In the operation of the s||tem: 

1 . The NIC can determine the space availa||eir writing by comparing RDPTR' and 
WRPTR, which it stores locally. Sfe; 

2. NIC generates out of band data when it ^received in a datagram and writes it to 

,itl 

the queue 59. 




10/21/05, EASTVei^p: 2.0.1.4 



WO 2004/025477 ? .«B1- PCT/GB2003/003971 

24 

4|1| 

3. The NiC updates WRPTR. and WRPTR\vfiah the"data has been ; writteiir§olhaf 
the next data will be written after the last data. 

. - ..>^v ; . / , .... 

4. The application determines the space available for reading by comparing RDPTR 
and WRPTR' as access from .memory 60 .fell; * 

5. The application reads the out of band data* from queue 59 and processes, the 

■ • . . -v ■ >x: 

messages. • 

6. The application updates RDPTR arid RDfp^. 

7. If the application requires ah interrupt, itpen it (or the operating system on its 
behalf) sets the IRQ 65a and IRQ' 65b tffepf the control block 64. The control 
block is stored in memory 60 and is ma|p^p onto corresponding storage in the 

NIC. If set, then the NIC would also gener^i'an interrupt on step 3. 

^||| 

If an interrupt is generated, then firstly the P^nterrupt line is asserted to ensure the 
computer's interrupt handler is executed, butiil|^ a second message is delivered into 



the operating system's queue. In general f |ttj]s queue can handle many interrupt 
types, such as hardware failure, but in this ca|a^the OS queue contains the following 
message [ODBDATArPORT] indicating that p^tiof band data has been delivered to 
the application queue belonging to [PORT]. : f||te OS can examine the data in queue 



59 and take appropriate action. The usualisituation will be that the application is 
blocked or descheduled and the OS musj|wake it (mark as runnable to the 
scheduler). 

This dual queue mechanism enables outfi?:band data to be handled by the 
application without involving the OS - whil^tp application is running. Where the 
application(s) is blocked, the second queue i^n^interrupt enable the OS to determine 



which of potentially many application queuef $ave had data delivered. The overall 
arrangement is illustrated in figure 9. 

The out of band (OOB) queue holds out of b^^data, which are: 

1 . Error events associated with the port '|||> 

2. Connection setup messages and otherpjgnalling messages from the network 



'■lip 

and other applications ' |» 
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3. Data.. -delivery events, which . may b.S; generated either by the sending^ 
application the NIC or the receiving OS$^'. ' ' 



If .the queue is to contain variable sized data'^then the size of the data part of each 
message must be included at the start of the giessage. 



When applications are to communicate in thei^|sent system over shared memory, a 
single work queue can be shared between tv^Mcommunicating endpoints using non- 
coherent shared memory. As data is written^nto the queue, write pointer (WRPTR) 
updates are also written by the transmittingj(application into the remote network- 



w 



mapped memory to indicate the data valid fc^igading. As data is removed from the 

queue, read pointer (RDPR) updates are written by the receiving application back 

St- 
over the network to indicate free space in thej[queue. 



•ft* 

These pointer updates are conservative and rriay lag the reading or writing of data by 
a short time, but means that a transmitter wiMot initiate a network transfer of data 
until buffer is available at the receiver, and^jhte low latency of the pointer updates 
means that the amount of queue buffer-|t|ace required to support a pair of 
communicating endpoints is small. The eveWmechanism described above can be 
used to allow applications to block on fu!|e/npty queues and to manage large 
numbers of queues via a multiplexed event s^^m, which is scalable in terms of CPU 



usage and response time. g™^. 

W 

Variable length data destined for an event qb'eu'e would be delivered to a' second 
queue. This has the advantage of simplifyijjii'the event generation mechanism in 



hardware. Thus the fixed size queue contairis|simple events and pointers (size) into 
the variable length queue ' * 

1. As shown in figure 13, the difference Bqfyeen RDPTR, and WRPTR, indicates 
the valid events in the queue, and also^ne number of events because they are 



of fixed size. * 
2. The event Var 10 (for illustration) indi^|s that a variable sized event of size 
10 words has been placed on the varial^sized queue. 
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3. The difference between WRPTR 2 and|RDPTR^ indicates only the number of * 

words which are in the variable sized!c|ueue, but the application is able to ' 

..... , • * • - 

dequeue the first event in its entirety by^emoving 10 wordsr 

.... . . . 

4. The application indicates processing bf^an event to the NIC by updating the 

RDPTR on the NIC's memory . ' • 

(a) for the static queue by the number of events processed multiplied by 
the size of each event ; 'i|tft 

(b) for the variable sized queue^|^the number of words consumed (i.e. 



the same for both cases) | ||§|' 
on the 
UDP/IP packet) 



The data on the variable length quei^^^y also contain the size (e.g. if it is a 

Enhanced Aperture Mappings and u Non6^^ks v 

'Ifff' 

In this implementation, additional bits, termed "nonce bits" are provided in order to 
protect against malfunctioning or majfeipus hardware or software writing 
inadvertently to apertures. To illustrate thisi?the. following network mapping will be 
discussed: . 



<virtual memory<address> -* <PCI add"ress> -> <host:in-index> ■ 
... <network packet> <PCI address^^ <physical memory address> 
... <virtual memory address> 



When performing the mapping to <host|i||ihdex> the NIC is able to create an 
outgoing packet which is addressed by in-index>. This will be recognized 
by the NIC that receives the packet as being- a packet intended for processing as 



an aperture packet, rather than as a packet intended to pass via a port to a 
corresponding application. Thus the pa||et is to be presented to the incoming 
aperture lookup hardware. 



It should first be noted that under the scheme described above, the PCI address 

Iff' 

to which the data is sent encodes- both tfigpperture mapping and an offset within 
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the aperture. This' is because the NlC c dan form the destination address as a 
function of the address to which the message on the PCI bus was formed. The 
address received by the NIC over the PCptis' can be considered to be formed of 
(say) 32 bits which include an aperture -definition and a definition of an offset in 
that aperture. The offset bits are also encoded in the outgoing packet to enable 
the receiving NIC to write the data relativl^ the incoming aperture base. In the 
case of a data write the resulting networl^^icket can be considered to comprise 
data together with a location definition comprising an offset, an in-index and an 
indication of the host to which it is addressed. At the receiving NIC at the host 
this will be considered as instructing wri^|J of the data to the PCI address that 
corresponds to that aperture, offset by thj^eceived offset. In the case of a read 
request the analogous operation occurs! P!Kis feature enables an aperture to be 
utilized as a circular queue (as describe^fpreviously) between the applications 
and avoids the requirement to create a, aperture for each new receive data 
•buffer. : » 

In this implementation the network packet|a.Iso contains the nonce bits. These 
are programmed into the aperture ma$j$|(g during connection setup and are 
intended to provide additional security, e^Bjing apertures to be reused safely for 
many connections to different hosts. [ 

> Wife 

||gy 

The processing of the nonce bits for corp|punications between hosts A and B is 
as follows: /l|t< :f 

1 . At host A a random number is selected as nonce A. 

2. Nonce A is stored in conjunction w^an aperture, in-index A 

3. A connect message is sent to hos|p^to set up communications in the way 
generally as described above. In tij|s example the message also includes 
nonce A. Thus the connect messag&includes port B, in-index A, nonce A. 

4. On receiving the connect message|$6st B stores in-index A and nonce A in 
conjunction with outgoing apertur^t^ 

- 5. Host B selects a random number &|pjtance B 
6. Nonce B is stored in conjunction ^f^^ n a P er * ure in-index B 
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.7. An accept" message is sent tpMqst;; B to accept the set up of 
communications in the way generalises described above. In this example 
the message also "includes nonce ^j^hus the accept miessage includes 
port B', in-index B, nonce B. - 
8. Host A stores in-index B and nonc^Bpn conjunction with outgoing aperture 



Once the connection is set up to include theiponce bits all packets sent from A to B 
via outgoing aperture A will contain nonce B .^hen received the NICb will look up in- 
index B and compare the received nonce valUe/with that programmed at B. If they 
differ, the packet is rejected. This is very usem|if a malfunctioning application holds 
onto a stale connection: it may transmit afp^ket which has a valid [hostin-index] 
address, but would have old nonce bits, and ^yould be rejected. 

Remembering that the user level applicatior^is a control block for the out of band 

fef^. ■ 

queue, this control block can also be used; to allow control of the apertures 
associated with the application, in such a \A^j|that connection setup and tear down 
may be performed entirely at user level. 1 

Note that some parts of the aperture conl^jblock only are user programmable, 
others must only be programmed by the opening system. 

• User Programmable bits include: noncelbits, destination host bits 

• O/System Programmable bits include:®IV 

a) base address of incoming apertcfre (this prevents an application from 
corrupting memory buffers by mistak^ip|malintent) 

b) source host bits (this prevent§;^an application from masquerading as 



originating from another host) 
For an untrusted application, kernel connection management would be performed. 
This means that out of band data would b^processed only in the kernel, and no 
programmable bits would be made availabl^p^he application 
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... y^)^';:^' ^^^^^^^ ■ 

An-example of an outgoing aperture table is ; ^o^.jn Tigure 10. Each row of the ' 

table represents an aperture and indicates thefajtribUteis of that aperture. It should 

be* noted that: : \\ ■ * ^V: ; V\" - '* ' ' \v'V^ 

1. A number of aperture sizes may be supported. . These will be grouped such that 

' • ' *' ■ 

the . base address also gives the size of $ie I -aperture. Alternatively, a size field 

can be included in the aperture table. |||V 

2. The type field indicates the Ethernet type^ use for the outgoing packet. It also 
indicates whether the destination addresses a 4 byte IPv4 address or a 16 bit 
cluster address. (IPv6 addresses or othei>protocol addresses could equally be 
accommodated) The type field • also distinguishes between event and data 
packets within the cluster. (An event packet will result in a fixed size event 

: mt 

message appearing on the destinations event queue). 

3. The PCI base address is OS programmable only, other fields may be 
programmed by the application at user depending on the system's security 
policy. ' jj$<- 

4. Source Ethernet address, Source IP arii*;Cluster address and possibly other 
information is common to all entries and stored in per NIC memory. 



In all cases addressing of the outgoing Eth|rpet packet is either 

<Ethernet MACxIP host : IP port> jfifithe case of a TCP/IP packet) 



or ' "4 



<Ethernet MACxCI host : CI in-inde^£iCI nonce : CI aperture offset> (in the 
case of a CI (computer interface) packet)^, 
(n.b. the offset is derived from the PCI address issued). 



6. Each aperture is allocated an initial sequence number. This is incremented by the 
hardware as packets are processed andpr&optionally included in cluster address 



formats 



An example of an incoming aperture table i^lpwn in figure 11. Each row of the 



table represents an aperture and indicates th^attributes of that aperture. The 
incoming aperture is essentially the. reverse j|Mhe outgoing aperture. It should be 
noted that: u * 



m 
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1.. /As -well as the size being optionally encoded by- having fixed size tables, the 
Ethtype can be optionally encoded bypgi^u ping separate aperture tables 



Z The sequence number fields are optiorialiand thefeceiver can set ' 

(a) whether sequence checking should beidphe 

(b) the value of the initial sequence numbed 

If done this must also be com'municaMGtlas part of the connection protocol, 
which could conveniently be performedaMa similar way to the communication 
of nonce values from one host to anot^^ ' 

3. Similarly to outgoing apertures, somejl^prrnation is Per-NIC e.g. IP address, 
Ethernet address. . f^Sff 

4. For application level robustness it is pq|sibi| to "narrow" down an aperture by 
specifying an address and size whicl^^jecifies a range which lies within the 
default range. This might be done wfte$ihe application level data structure is 
of a size smaller, or different alignmer\f ;$Jian the default aperture size and fine 



grained memory protection is required||^ 
5. The map address is either the PCI address which the NIC should emit in order 
to write to memory for the aperture, or|{else a local (to the NIC's SRAM) pointer 
to the descriptor for the event queue, 



A PCI write for an outgoing aperture is processed as shown in figure 12. The steps 
are as follows. • • |^| 

1 . A PCI burst is emitted whose addressWails within the range allocated to the 



NIC -life 
2. The NIC's address decoder captures|the burst 



and determines that the 



address is within the range of the a^^ures. (It could otherwise be a local 
control write). S^j 

Depending on the aperture size (wniefi is coarsely determined from the 



address), the address is split into <bafse:offset>. E.g. for a 1k aperture, the 

/tip 

bottom 10 bits would be the offset: j!#6e base is fed into the aperture table 
cache to match the required packet hefifeV information. 

■ :ip 

4. Depending on the Ethernet packetg^pe field either an IP/Ethernet or 
CI/Ethernet packet header is formed. 



sea 
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5. The C.f packet would ^ 

' Data (containing the data pa^baij-pf the PCI burst) 1 ' 
■^.Checksum (calculated " b^ hardwarefbver the contents of the header) - : 
. ,Qffeet a (by the address-decoder)'.^!;. 

Sequence number ' !/. ,J§|;* 

Nonce ' p 

Aperture index 

CI Host cluster address 

6. If a number of PCI bursts arrive for a pa^iilar host, then they may be packed 
into a single Ethernet frame with compression techniques applied to remove 
redundant header information j ^| 

7. In the present system a system-specificpRC or checksum is used to provide 



end-to-end protection and is appended^t^; the data portion of the packet. 
Although the Ethernet packet also contain^'a CRC, it may be removed and 



recalculated on any hop (e.g. at a switcKi)|and so does not provide protection 
against internal (e.g. switch-specific) cormp|ions. 



8. If the sequence number is applied, theri^if is incremented and written back to 
the aperture table entry ( : P1^ 



For incoming packets, the reverse operation ; ^^s place. The incoming aperture is 
looked up and checked to be: 

(a) valid; 

(b) sequence number expected matche^at of the packet; 




"WW 

(c) nonce matches (or port); . ; fM> 

Jm 



(d) expected Ethernet source address; £ffe 

(e) expected IP or CI source addressest(which may be specified as a netmask 

to allow a range of source addressesto be matched); - 



Any one or more of these checks may be implemented or omitted, depending on the 
level of security required. l^^- 
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This lookup returns .a field of: (base* extent)$#lhe aperture. 'The offset is checked 7 
against the extent to ensure out of aperture laacess is not made and a PCI write is * 



formed and emitted on the receiver's PC) busptrth the format 



base +fcffset 



DATA 2> . DATAi 



If the PCI bus is stalled, (say on DATAn) a neM^CI transaction will be emitted. 

rife, 



DATAn+y DATAn 



base +fpff?et + N 



Similarly if consecutive CI data packets arriyl^|ley may be coalesced into larger PCI 
bursts simply by removing the redundant intermediate headers. 

The applicant hereby discloses in isolation i individual feature described herein 



and any combination of two or more such features, to the extent that such features or 
combinations are capable of being carried out^based on the present specification as 
a whole in the light of the common general knowledge of a person skilled in the art, 
irrespective of whether such features or combinations of features solve any problems 

disclosed herein, and without limitation to thfefjscope of the claims. The applicant 

■ ■ |f |v 

indicates that aspects of the present inventipj||:may consist of any such individual 
feature or combination of features. In vie\|pf the foregoing description it will be 
evident to a person skilled in the art that vatr®is modifications may be made within 
the scope of the invention. 
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CLAIMS, ^ 

1. A communication interface for providing a^^terface between a data jink and a 
data processor, the data processor being capMle of supporting an pperating system 
and. a user application; the communication interface being arranged to: 

. support a first queue of data received ||^hthe link and addressed to a logical 
data port associated with a user application;' 

support a second queue of data recdv^&bver the link and identified as being 
directed to the operating system; and i^ff ' 

analyse data received over the link and?identified as being directed to the 
operating system or the data port to determin^vyhether that data meets one or more 
predefined criteria, and if it does meet the criteria transmit an interrupt to the 
operating system. 




2. A communication interface as claimed in el|i|iri 1, wherein the user application has 
an address space and the first queue is locate^ih that address space. 



i: 1 



3. A communication interface as claimed in claim 1 or 2, wherein the operating 
system has an address space and the secoridjqueue is located in that address 
space. :fflS3?* 



Mm*?' 

... 

4. A communication interface as claimed in claim 3 as dependent on claim 2, 
wherein the user application and the operatirig^ystem have the same address 
space. ifffe 

w 

5. A communication interface as claimed in any preceding claim, the communication 
interface being arranged apply to the first queu§,data received over the link and 

6. A communication interface as claimed in$ny?preceding claim, the communication 
interface being arranged apply to the secondMueue data received over the link and 
identified as being directed to the operating sMem. 



identified as being directed to the data port. "Wk ' 
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7: A communication interface as claimed in ar$' preceding claim, wherein one of the 
predefined criteria is such' that if the data receded over the link matches one or more 

predetermined message forms then the communication interface will transmit an 
interrupt to the operating system. tfe? 



8. A communication interface as claimed in sftt^preceding claim, wherein the 
communication interface is arranged to, if thefdata meets one or more of the 
predefined criteria and one or more addition^||nteria, transmit an interrupt to the 
operating system and transmit a message tq^m operating system indicating a port to 
which the data was addressed. ^^fti 

Ipl 

9. A communication interface as claimed in claim 8, wherein the additional criteria 
are indicative of an error condition. -asfi** 

■ 

10. A communication interface as claimed in/any preceding claim, wherein the 
communication interface is arranged to support a third queue of data received over 
the link and addressed to a logical data portlassociated with a user application, and 
is arranged to apply to the first queue data urfl^ received over the link and of a form 
having a fixed length and to apply to the thir^Cieue data units received over the link 
and of a form having a variable length. {fftlk 

1 1 . A communication interface as claimed inplaim 10, wherein the data units, of a 
fixed size include messages received over Unlink and interpreted by the 
communication interface as indicating an er^ptatus. 

12. A communication interface as claimed ;i|^Iaim 10 or 11, wherein the data units 
of a fixed size include messages received oj^C'the link and interpreted by the 
communication interface as indicating a requjsst for or acknowledgement of set-up of 
a connection. ,ai 
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13. Acommunication interface as '.claimed in^ny of claims 10 to 12, wHereiri'the 



data units of a fixed size include messages received over the link and interpreted by 

■ t : . - . - Iff £ • .. 

the communication interface; as indicating a ;dafa deliyery. event. 



14. A; communication interface as! claimed Wary/ preceding claim, wherein the 
communication interface is arranged to anaWs&the content of each data unit 
received over the link and to determine in dependence on the content of that data 

m 

unit which of the said queues to apply the datafunit to. 



1 5. A communication interface as claimed in^iy preceding claim, wherein the 
communication interface is configurable by tfiMoperating system to set the said 
criteria. • f fc 

m 



16. A communication interface as claimed >Q|||y preceding claim, wherein one or 
both of the communication interface and thefararating system is responsive to a 
message of a predetermined type to return ajpessage including information 
indicative of the status of the port. 



17. A communication system including a aimrnunication interface as claimed in 



' claim 16, and the data processor, the data p|||essor being arranged to, when the 
processing of an application with which a data|port is associated is suspended, set 
the criteria such that the communication interface will transmit an interrupt to the 
operating system on receiving data identifiedj|iB being directed to that data port. 

' 'Prat* 

1 8. A communication interface for providing|ai^interface between a data link and first 



'HI 

data processing apparatus including a memo^the data interface being such that a 
region of the memory of the first data processing apparatus can be mapped on to 
memory of a second data processing appan|^ connected to the communication 
interface by the link, the communication inte^p being arranged to, on establishing 
a mapping of a first range of one or more memory locations in the second data 
processing apparatus on to a second range>ofone or more memory locations in the 
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first daferprocessing apparatus, transmit to th^becond ; data processing apparatus, ' ' 
data identifying the first range of memory loc|p)ns. • ■ ' ' 

19*. A communication interfaces as claimed inSlaim 18, wherein the said one or ' 



more memory locations in the memory of the$fst data processing apparatus are one 
or more virtual memory locations and the co^rjniinication interface is arranged to, on 



establishing the said mapping, establish a fuppr mapping of the one or more virtual 



memory locations on to one or more physical^emory locations in the memory of the 
first data processing apparatus. 



20. A communication interface as claimed l>l^! im 1 8 or 19, wherein the 
communication interface is arranged to, on e^|blishing a mapping of a first range of 
one or more memory locations in the memqr^lgfjthe second data processing 
apparatus on to a second range of one or mdre-memory locations in the memory of 

■ flip 

the first data processing apparatus, allocate 'afcidentity to that mapping and transmit 
that identity to the second data processing apparatus. 

•IS'' 

21 . A communication interface as claimed Mpy of claims 18 to 20, wherein the 
communication interface is capable of comm||ipating by means of data messages 
which specify a destination port to which da^they contain is to be applied. 



22. A communication interface as claimed inMijy of claims 18 to 21 , wherein the 
communication interface is arranged to, on ! ||^lishing a mapping of a first range of 
one or more memory locations in the memd^p the second data processing 
apparatus on to a second range of one or m^^memory locations in the memory of 
the first data processing apparatus, determiri||check data and transmit the check 
data to the second data processing apparatus! £nd wherein the communication 
interface is arranged to reject subsequent cor||punications over the mapping which 
do not indicate the check data. 




23. A communication interface as claimed iSclaim 22, wherein the check data is 



randomly generated by the communication int^nface. 



I 
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24. A communication interface as claimed inllairn 22 or 23, wherein to indicate the 




check dataVcommunication includes tin& ch^plata. • 

25. A communication interface as claimed in|a|y of claims 22 to 24, wherein the 
communication interface is arranged to modif|J|He check data, according to a 

predefined scheme, during the operation of tK^mapping. 

26. A communication interface as claimed in|elaim 25, wherein the check data 

'Mis- 
represents a number and the predefined sch|i|| is to increment the number 

represented by the check data by a predefihl||mount each time a predefined 

number of communications over the mappincf|re accepted. 

. Jl 

27. A communication interface as claimed irgy of claims 1 8 to 26, wherein the 
communication interface is arranged to rejec^ibsequent communications over the 
mapping which indicate a request for access|^g|lata outside the first range. 

28. A communication interface as claimed ir ^ im 19 or an y of claims 20 to 27 as 
dependent on claim 19, wherein the communication interface is capable of 



supporting a plurality of mappings each of a r||pective first range of one or more 
virtual memory locations in the second datai ^qbessing apparatus on to a respective 
second ranqe of one or more memory locations^ the first data processing 
apparatus, and for each such mapping a resp^jtive further.mapping of the respective 
one or more virtual memory locations on to|qnMbr more physical memory locations in 
the memory of the first data processing app^gs. 

29. A communication interface as claimed inpaim 28, comprising a translation 
interface for translating accesses to or from %ch of the said ranges of one or more 
virtual memory locations into accesses to °^|| n the res P ective one or more physical 
memory locations in the memory of the first tata processing apparatus and for 
translating accesses to or from each of the ph^or more physical memory locations in 
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the memory of the. .first data processing. appar^p' into* accesses to or 



•from the 



respective ranges of one or more virtual mem|)iplocations. 



30. A communication interface as claimed iri^|m 29, comprising a mapping 
memory arranged to store specifications of thj^aid further mappings. 



31 . A communication interface as claimed inMlifm 30, wherein the mapping memory 
comprises a first mapping memory local to th^anslation interface, and a second 



mapping memory less local to the translation||||rface than the first mapping 
memory and wherein the communication interlace is arranged to store specifications 



of all of the further mappings in the second m|gping memory, and to store 



specifications of only some of the further map|ings in the first mapping memory. 



PI? 4 -!' 

32. A communication interface as claimed in&Iaim 31 , wherein the first mapping 



memory is an associative memory. ' ^j-, 



33. A communication interface as claimed in|Jajm 31 or 32, wherein the translation 
interface is arranged to, in order to translate ffi&een an access to or from one of the 
said ranges of one or more virtual memory locations and an access to or from the 
respective one or more physical memory locMions in the memory of the first data 
processing apparatus, preferentially access!)tfe|Tirst mapping memory to implement 
the translation, and if the specification of th?p|pping of the range of virtual memory 
locations the subject of the access is not storefljn the first mapping memory to 
access the second mapping memory to implement the translation. 

ill 

34. A communication interface as claimed infjgny of claims claim 31 to 33, wherein 
the communication interface is arranged to store specifications of the most recently 
used further mappings in the first mapping r$ejmory. , 

'■Jpi?" 

35. A communication system including a qjmnjunication interface as claimed in any. 
of claims 18 to 34, and the data processor, 'tn¥clata processor being capable of 
supporting an operating system and a user.gpjilication, and the system comprising a 
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data store which store's item's of data defihing\op.eration parameters for 



communications over the data link to transmttjiaata stored in the first range or receive 
■•'data for storage in the first range. - • : ' " ' 



.36.. A communication system as claimed in clajjii 35, wherein the operating system is 

arranged to permit a user application to access^one or more items of data in the data 

. . • ■ ■ * 

store dependent on a level of trusi.granted tb#ve|.application. 

37. A communication system as claimed in cjjyj^ 36 as dependent on claim 22, 
wherein the check data is stored as one of th^ijtfms of data in the data store, the 
operating system is arranged to permit at leasggme user applications to have write 
access to that item of data, and the communi|^pn interface is arranged to, in order 
to determine the check data, read the conten|^f|hat item of data and treat it as the 
check data. 

38. A communication system as claimed in claim 36 or 37, wherein items of data in 
the data store define the start and end points$$the first range of memory locations in 
the memory of the first data processing appar^s and store the start and end points 
of the second range of memory locatipns in the|r)iemory of the second data 
processing apparatus, and the operating systlm is arranged to permit applications 
having one or more levels of trust to have writ^access to the.items of data in the 
data store that second, and store the start ar\d|end points of the second range of 
memory locations in the memory of the secqnMclata processing apparatus and to 
permit no applications to have write access t^p; items of data in the data store that 
define the start and end points of the first rangjgf i^mory locations in the memory 
of the first data processing apparatus. ' (l • 



iff 

39. A communication interface as claimed irfaiy of claims 1 to 16 and as claimed in 
any of claims 1 8 to 34. 



40. A communication system' as claimed in <|ijm 17 and as claimed in any of claims 



35 to 38 



'I'm'' 
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BLACK BORDERS 
^ IMAGE CUT OFF AT TOP, BOTTOM OR SIDES 
$ FADED TEXT OR DRAWING 

□ BLURRED OR ILLEGIBLE TEXT OR DRAWING 

□ SKEWED/SLANTED IMAGES 

l^jf COLOR OR BLACK AND WHITE PHOTOGRAPHS 

□ GRAY SCALE DOCUMENTS 

□ LINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCE(S) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY 

□ OTHER: . - 

IMAGES ARE BEST AVAILABLE COPY. 
As rescanning these documents will not correct the image 
problems checked, please do not report these problems to 
the IFW Image Problem Mailbox. 



